Introduction

Data Source1

The dataset is related to red variant of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

The inputs include objective tests (e.g. pH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Attribute information

  • Number of items: 1599.
  • Number of attributes: 13.
Input variables (based on physicochemical tests)
  1. fixed acidity (tartaric acid - g/dm3)
  2. volatile acidity (acetic acid - g/dm3)
  3. citric acid (g/dm3)
  4. residual sugar (g/dm3)
  5. chlorides (sodium chloride - g/dm3)
  6. free sulfur dioxide (mg/dm3)
  7. total sulfur dioxide (mg/dm3)
  8. density (g/cm3)
  9. pH
  10. sulphates (potassium sulphate - g/dm3)
  11. alcohol (% by volume)

Description of attributes

  1. fixed.acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily);
  2. volatile.acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste;
  3. citric.acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines;
  4. residual.sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet;
  5. chlorides: the amount of salt in the wine;
  6. free.sulfur.dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine;
  7. total.sulfur.dioxide: amount of free and bound forms of SO2; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine;
  8. density: the density of water is close to that of water depending on the percent alcohol and sugar content;
  9. pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale;
  10. sulphates: a wine additive which can contribute to sulfur dioxide gas (SO2) levels, wich acts as an antimicrobial and antioxidant;
  11. alcohol: the percent alcohol content of the wine;
Output variable (based on sensory data)
  1. quality (score between 0 and 10).

Descriptive Statistics

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Quality grades distribution:

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

As we can see on the statistics summary above, no wines have quality value smaller than 3 or bigger than 8. Also, we can see that the quality value is discrete, and should be treated as an ordinal variable.


Univariate Plots Section

Let’s start by looking on what are the distributions by some of the variables:

Taking a first look on the graphs, we can see that the quality has a somewhat normal distribution. The same happens with pH and Density. the distribution is mostly right skewed for all other attributes, which seems to point to consistent low concentrations of those attributes.

Specifically, let’s take a look on the Sulfites and Sulphates distributions:

As we can see, most wines have a low concentration of those compounds, with just few of them having a higher amount of sulphates and sulfites.


Univariate Analysis

What is the structure of your dataset?

There are 1599 wines in the dataset with 12 features (as seen above). The variable quality is discrete and varies from 0 to 10, but in this dataset, the minimum is 3 and the maximum 8.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in this dataset is the quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The other features will be used to investigate their influence in the perceived quality of the wine, in special the ones that relate to the perceived flavor (like volatile acidity, residual sugar and chloride).

The presence of sulphates and SO2 (sulfites) is also evaluated. Sulfites are generated by the fermentation and aging processes, and may taint the wine flavor. One common way to balance this effect is to add sulphate - usually Copper Sulfate (CuSO4) to reduce the formation of sulfites. I will investigate how the levels of sulfites and sulphates affect the perceived quality of the wine.


Bivariate Plots Section

## `geom_smooth()` using method = 'gam'

Correlation between Density and Alcohol level:

## 
##  Pearson's product-moment correlation
## 
## data:  density and alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

Here, we start by evaluating how the alcohol percentage affects the density. The graph and the correlation index are consistent with: - Beverages are mostly water; - Water is more dense than alcohol; - Higher percentages of alcohol make a wine less dense.

Correlation between Quality and Alcohol level:

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(quality) and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

We can see that the best evaluated wines have consistent higher levels of alcohol, and also the presence of a high number of outliers in the quality grade 5. The correlation test seems to point in this direction too.

Correlation between Quality and pH level:

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(quality) and pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

We can see that the pH level has null to very little effect on the perceived quality of the wine.

Correlation between Quality and Citric Acid level:

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(quality) and citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

The graph and the calculations seems to indicate a small correlation between the Citric Acid amount and the perceived quality of the wine.

Correlation between Quality and Residual Sugar level:

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(quality) and residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

The quality also seems to not be affected by the residual sugar, but there are several outliers in this case.

Correlation between Quality and Sulphates level:

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(quality) and sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

The quality seems to be lightly affected by the presence of sulphates, and there are several outliers in the 5-6 quality range.

Correlation between Quality and Volatile Acidity level:

## 
##  Pearson's product-moment correlation
## 
## data:  as.numeric(quality) and volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

The volatile acidity is negatively correlated to the quality - the less, the better.


Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Our main investigation was about how the quality is affected by several attributes in the dataset. As can be seen above, some variables affect the perceived quality positively (alcohol or sulphates), some negatively (volatile acidity) and some seem to not affect whatsoever (residual sugar).

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The relationship between the density and the alcohol level seems strong - the more alcohol, less dense is the wine - which makes sense: as any beverage, wines are mostly water. As alcohol is less dense than water, the more alcohol, less total density.

What was the strongest relationship you found?

The relationship between density and alcohol percentage. And among the ones studying the wine quality, the relationship between the alcohol percentage and the perceived quality.


Multivariate Plots Section

The amount of Free SO2 is consistent with the amount of Total SO2 in the studied wines. The number of wines showing high level of sulphates is small.

Consistent with the previous graph, we can see that most wine have small concentrations of Sulphates, and that does not affect considerably the total SO2.

We can see a strong correlation between the pH level and the amount of citric acid in the wine. The pH level seems to affect less the perceived quality than the citric acid, however.


Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

As we can see, most wines have low level of both Free and total SO2, and seem to use small amounts of Sulphates. Few wines have high levels of sulphates, and that may indicate a good control over aging process by the producers.

Also, the expected relation between pH and citric acid is present (lower pH = higher acidity). And we can see that the quality of the wine is affected by the level of Citric Acid, but not so much by the pH level.

Were there any interesting or surprising interactions between features?

I was expecting to see smalles levels of added sulphates on higher quality wines, which could indicate a better or more traditional aging process. It seems the opposite - higher quality wines have higher sulphate levels.


Final Plots and Summary

Plot One

Description One

As explained in the dataset, we can see that there’s a light correlation between the perceived quality and the citric acid level. Interestingly, we can see that this relationship does not extend to the pH level - not all wines with high citric acid level have low pH.

Plot Two

Description Two

The residual sugar seems to not affect the perceived quality. This is somewhat interesting, because the vinho verde is not a sweet wine (which could have made the sweeter ones to have a poor evaluation).

Plot Three

Description Three

Of all the parameters evaluated, this one seems interesting to me - the correlation between quality and alcohol level. There’s a somewhat strong correlation between the alcohol level and the perceived quality of the wine. Maybe the reviewers were more interested in the effects of wine than flavors? :)


Reflection

The redwine dataset contains 1599 observations, across 12 variables2, from sometime around 2009 for the red vinho verde wine. My initial approach was to look at the variables names and their summary statistics, to identify interesting values for study.

My main focus was the quality variable - it is defined as the median from three separate reviews from expert in wines. I tried to investigate what chemical characteristics were consistent with the grades.

The main challenge I found was understanding the relationship about some chemical processes used in the wine production - the addition of Sulphates to improve the aging process, for instance. Also, I was under the impression that sweeter wines would be worse evaluated than drier ones - and the data does not support this point of view.

Several factors seem to affect the perceived quality - some of them positively, some not. Among the positive ones we can see that citric acid and alcohol were the most proeminent ones, and the volatile acidity negatively affects quality.

Some limitations need to be considered: the reviews were made by a small set of reviewers, there are no information about the methodology adopted in these reviews, and the dataset only studies an specific kind of wine - the red wine variety of vinho verde, without considering the specificities among them type of grape, for instance).


  1. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties.
    In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
    Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
    [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

  2. the first variable in the dataset (X), is just a sequential ID for each observation, and was ignored.